Cut Claude Code token usage 82-97% with local LLMs.
Claude Code's Max plan quota can vanish in 19 minutes. A single screenshot costs ~15,000 tokens; one DOM snapshot costs ~114,000. Retry loops burn tokens infinitely with no built-in detection -- the #1 pain point (666+ upvotes).
helix-agent is an MCP server that compresses screenshots, DOM, and browser output through your local GPU before Claude sees them -- and detects retry loops before they drain your quota. Connect it to Claude Code and savings happen automatically; no workflow changes needed.
| What | Without | With helix-agent | Reduction |
|---|---|---|---|
| Screenshot analysis | ~15,000 tokens | ~400 tokens | 97% |
| DOM/HTML processing | ~114,000 tokens | ~500 tokens | 99% |
| Browser automation | ~15,000 tokens/action | ~1,000-2,700 | 82-93% |
| Retry loops | Infinite (until quota dies) | Stopped at 3rd repeat | 100% |
| Routine tasks | Opus tokens ($$$) | Local LLM ($0) | 100% |
All compression runs on your local GPU via Ollama. Zero cloud API cost.
| Without helix-agent | With helix-agent | |
|---|---|---|
| Screenshot | 15,000 tokens raw image | 400 tokens structured text |
| DOM snapshot | 114,000 tokens raw HTML | 500 tokens action summary |
| Retry loop | Runs until quota dies | Stopped at 3rd repeat |
| Routine task | Opus ($$$) | Local Ollama ($0) |
| Cloud API cost | $50-200/month in waste | $0 |
git clone https://github.com/tsunamayo7/helix-agent.git
cd helix-agent && uv sync
ollama pull gemma4:e2b # 8GB GPU (or e4b/26b/31b for larger)
uv run python server.pyAdd to ~/.claude/settings.json:
{
"mcpServers": {
"helix-agent": {
"command": "uv",
"args": ["run", "--directory", "/path/to/helix-agent", "python", "server.py"]
}
}
}Restart Claude Code. Done.
$0 cloud cost. All compression, retry detection, and delegation runs on your local GPU via Ollama. No API keys, no subscriptions, no metered billing. Your tokens stay on your machine.
Claude Code (Opus)
|
+-- helix-agent (MCP server)
|
+-- vision_compress ---- Local LLM ----> ~400 tokens (was 15,000)
+-- dom_compress ------- Local LLM ----> ~500 tokens (was 114,000)
+-- retry_guard -------- Pure logic ----> Loop stopped (sub-ms)
+-- think / agent_task - Local LLM ----> $0 reasoning
+-- computer_use ------- agent-browser -> 82-93% saved
+-- code_review -------- 4-layer LLM --> $0.20 total
| Platform | GPU | Status |
|---|---|---|
| macOS (Apple Silicon) | Metal / M1-M4 | Tested daily |
| Linux | NVIDIA CUDA | Primary dev environment |
| Windows (WSL2) | NVIDIA CUDA | Supported via Ollama |
| Windows (native) | NVIDIA CUDA | Supported via Ollama |
| CPU-only | None | Works (slower, ~30s per compress) |
Anywhere Ollama runs, helix-agent runs. 8GB VRAM minimum for GPU acceleration.
- Vision Compress -- Screenshot to structured text via local vision LLM. 15,000 tokens to 400.
- DOM Compress -- HTML/DOM to structured extract via local LLM. 114,000 tokens to 500.
- Retry Guard -- Detects identical tool calls before they loop. Sub-millisecond, no LLM needed.
- GPU Auto-Detection -- Detects your GPU at startup, selects the optimal model from 8GB to 96GB+.
All 27 tools
- Browser Automation -- Routes through agent-browser (Rust/CDP) with Playwright fallback. Native keyboard events fix React controlled components.
- 4-Layer Code Review -- gemma4 + Sonnet + Opus + Codex pipeline catches all issues at ~$0.20.
- Self-Evolving Memory -- Reviews conversations every 5 turns, saves reusable skills as SKILL.md files. Gets smarter over time, all local.
- Parallel Tasks -- Run multiple tasks simultaneously with 2-axis model routing (task type x input size).
- ReAct Agents -- Local LLM delegation with tool access, sub-agents, background workers, and JSONL tracing.
MCP tools that delegate to local LLMs can be tricked into accessing sensitive files. PathGuard prevents this with strict path allowlists -- delegated tools can only read/write directories you explicitly permit.
Defends against CVE-2025-59536 (RCE and API token exfiltration through Claude Code project files).
# PathGuard blocks unauthorized access automatically
HELIX_ALLOWED_PATHS=/home/user/projects,/tmphelix-agent runs in production daily on the author's own Claude Code workflow:
- 367 tests passing (pytest, all Ollama calls mocked)
- 17+ hour autonomous sessions with retry guard preventing quota drain
- 27 MCP tools + 3 Resources + 3 Prompts -- full MCP spec coverage
- Used to build helix-pilot, helix-codex, and itself (dogfooding)
helix-agent auto-selects the best model for your hardware:
| Your GPU | VRAM | Model | Compress Speed |
|---|---|---|---|
| RTX 4060 | 8GB | gemma4:e2b | 10.2s |
| RTX 4070 Ti | 16GB | gemma4:e4b | 11.8s |
| RTX 4090 / 3090 | 24GB | gemma4:26b | 14.7s |
| RTX PRO 6000 | 48GB+ | gemma4:31b | 27.5s |
gemma4:e2b on 8GB runs 2.7x faster than 31b with comparable compression quality. No expensive GPU required.
+--------------+ +-----------------+ +--------------+
| Screenshot |---->| vision_compress |---->| ~400 tokens |
| (15K tokens) | | (local gemma4) | | (text only) |
+--------------+ +-----------------+ +--------------+
+--------------+ +-----------------+ +--------------+
| DOM / HTML |---->| dom_compress |---->| ~500 tokens |
| (114K tokens)| | (local gemma4) | | (text only) |
+--------------+ +-----------------+ +--------------+
Real measurement (RTX PRO 6000):
Input: 1920x1048 screenshot of X.com (~15,000 tokens)
Output: "X home feed, Japanese UI, 'For You' tab active..." (~400 tokens)
Saved: 7,362 tokens in one call
Automated multi-LLM review at ~$0.20 total:
| Layer | Reviewer | Findings | Cost |
|---|---|---|---|
| 1 | gemma4 + RAG (local) | 7 | $0 |
| 2 | Sonnet 4.7 | 14 | ~$0.13 |
| 3 | Opus 4.7 (summary only) | 16 | ~$0.03 |
| 4 | Codex (P1 only, on-demand) | 5 | ~$0.33 |
| Combined | 16+ | ~$0.20 |
gemma4 + RAG ($0) outperforms Codex GPT-5.3 (~$0.33) in code review findings.
| Capability | helix-agent | Alternatives |
|---|---|---|
| Screenshot to text (97% cut) | Local vision LLM | No MCP server does this |
| DOM to text (99% cut) | Local LLM | Playwright MCP sends raw DOM |
| Retry loop detection | Sub-ms, no LLM | No built-in Claude Code detection |
| GPU auto-detect + model select | 8GB to 96GB+ | Manual config required |
| Self-evolving memory | SKILL.md + Qdrant | Unique to helix-agent |
| All 3 MCP primitives | 27 Tools + 3 Resources + 3 Prompts | Most MCPs implement Tools only |
27 tools organized by function:
| Category | Tools |
|---|---|
| Token saving | vision_compress, dom_compress |
| Loop prevention | retry_guard_check, retry_guard_status, retry_guard_reset |
| Local delegation | think, agent_task, fork_task, parallel_tasks |
| Vision & browser | see, browse, computer_use |
| Background agents | spawn_agent, send_agent_input, wait_agent, list_agents, close_agent |
| Memory | evolving_memory_review, list_learned_skills, get_skill, dept_search, dept_store |
| Code quality | code_review |
| Meta | providers, models, config, agent_types |
Plus 3 Resources (helix://status, helix://models, helix://config) and 3 Prompts (retry_report, optimize_tokens, setup_guide).
helix-agent works with zero configuration. For advanced setups:
# Environment variables (all optional)
OLLAMA_HOST=http://localhost:11434 # Ollama endpoint
HELIX_PROVIDER=ollama # LLM provider
HELIX_LOG_LEVEL=INFO # Logging levelOptional dependencies:
- Qdrant -- shared memory across sessions
- Playwright -- browser automation fallback
- agent-browser -- recommended for 82-93% browser token savings
| GPU VRAM | Command | Model Size |
|---|---|---|
| 8GB | ollama pull gemma4:e2b |
4GB |
| 16GB | ollama pull gemma4:e4b |
6GB |
| 24GB | ollama pull gemma4:26b |
12GB |
| 48GB+ | ollama pull gemma4:31b |
20GB |
- helix-pilot -- GUI automation MCP server
- claude-code-codex-agents -- MCP bridge to Codex CLI
- helix-sandbox -- Secure sandbox MCP server
helix-agent is an MCP server that Claude Code connects to. It does not wrap, proxy, or re-host Claude Code or the Anthropic API. Fully compliant with Anthropic's Terms of Service.
See CONTRIBUTING.md.
MIT
